Take-home Exercise 3

Take-home Exercise

Putting Visual Analytics into Practical Use: VAST Challenge 2022, Challenge 3 Economic.

Leslie Long Nu https://www.linkedin.com/in/leslielongnu/ (SMU, MITB)https://scis.smu.edu.sg/master-it-business
2022-05-16

1. Overview

1.1 Introduction

In this take-home exercise, the economic of the city of Engagement, Ohio USA will be revealed by using appropriate static and interactive statistical graphics methods.

With reference to Challenge 3 Question 1 of VAST Challenge 2022, the following questions will be addressed:

Over the period covered by the dataset, which businesses appear to be more prosperous? Which appear to be struggling? Describe your rationale for your answers.

1.2 Methodology

It is observed from the datasets provided by VAST Challenge 2022 that there are three types of businesses present in Engagement, Ohio USA, namely:

In this exercise, the robustness of different types of businesses will be evaluated by different criteria as the data available is different.

Workplaces

For workplaces, data is available on employees, jobs provided, wages, educational level requirement and etc. For restaurants and pubs, data is available on prices, customers’ visits, spending and etc. Therefore, in this exercise, workplaces will be evaluated base on two main criteria:

Restaurants and Pubs

On the other hand, restaurants and pubs will be evaluated based on:

According to the dataset descriptions provided by VAST Challenge, all restaurants have a Prix Fixe food cost for participants to dine in and all pubs have a hourly cost to visit the pub. Therefore, assuming all visits to restaurants are for dinning, restaurants’ revenue will be calculated by number of visits times Prix Fixe food cost. Similarly, pubs’ revenue will be calculated by duration of visits times hourly cost of visits.

Alternative approach of deriving balance difference before and after restaurants and pubs visits as spending is considered but not preferred as balance differences are inconsistent and could be due to unknown reasons.

2. Data Preparation

2.1 Install and Load Packages

The following code chunk installs the required R packages and loads them onto RStudio environment.

show
packages = c('ggiraph', 'plotly', 'DT', 'patchwork',
             'gganimate', 'tidyverse','readxl', 'gifski', 
             'gapminder', 'treemap', 'treemapify', 'rPackedBar',
             'trelliscopejs', 'zoo', 'd3treeR', 'ggridges')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2 Import Raw Datasets

Relevant datasets are imported using read_csv() of readr package, which is useful for reading delimited files into tibbles.

show
jobs <- read_csv('rawdata/Jobs.csv')
pubs <- read_csv('rawdata/Pubs.csv')
restaurants <- read_csv('rawdata/Restaurants.csv')
travel <- read_csv('rawdata/TravelJournal.csv')

The following code chunk is used to have an overview of the datasets.

show
summary(jobs)
summary(pubs)
summary(restaurants)
summary(travel)

2.3 Data Wrangling

2.3.1 Workplace

File jobs is cleaned by renaming values for ease of reading. A new file jobsedu is created using group_by() to reveal data on jobs offered for different education requirements.

show
jobs$educationRequirement <- sub('HighSchoolOrCollege', 
                                    'High School or College',
                                    jobs$educationRequirement)

2.3.2 Restaurants and Pubs

The following code chunk extracts travel records related to restaurants and pubs using filter() and derives spending of each visit using inner_join() and mutate().

Datasets are also cleaned by removing irrelevant columns using select() and renaming column names using rename() for ease of understanding.

show
restaurantstr <- travel %>%
  filter(purpose == 'Eating') %>% 
  mutate(travelTime = travelEndTime - travelStartTime) %>%
  select(-c(travelStartTime:travelEndTime, endingBalance)) %>%
  inner_join(y= restaurants, 
            by = c('travelEndLocationId'= 'restaurantId')) %>%
  mutate(visitDuration = checkOutTime - checkInTime) %>%
  select(-c(purpose, location, checkOutTime)) %>%
  rename('restaurantId' = 'travelEndLocationId',
         'spending' = 'foodCost')
  

pubstr <- travel %>%
  filter(purpose == 'Recreation (Social Gathering)') %>%
  mutate(travelTime = travelEndTime - travelStartTime) %>%
  select(-c(travelStartTime: travelEndTime, endingBalance)) %>%
  inner_join(y= pubs, 
             by = c('travelEndLocationId'= 'pubId')) %>%
  mutate(visitDuration = checkOutTime - checkInTime,
         spending = as.numeric(visitDuration/60)* hourlyCost) %>%
  select(-c(purpose, location, checkOutTime)) %>%
  rename('pubId' = 'travelEndLocationId')

The following code chunk is used to check for missing values.

show
sum(is.na(restaurantstr))
sum(is.na(pubstr))

2.4 Save as and Read RDS Files

The cleaned datasets are saved and read in RDS format to avoid uploading large files to Git.

show
saveRDS(jobs, 'data/jobs.rds')
saveRDS(jobsedu, 'data/jobsedu.rds')
saveRDS(restaurantstr, 'data/restaurants.rds')
saveRDS(pubstr, 'data/pubs.rds')
show
jobs <- readRDS('data/jobs.rds')
jobsedu <- readRDS('data/jobsedu.rds')
restaurants <- readRDS('data/restaurants.rds')
pubs <- readRDS('data/pubs.rds')

3. Visualizations and Insights

3.1 Workplaces

3.1.1 Number of Jobs Provided by Each Workplace

The following interactive dotplot shows that the size of the workplaces in Engagement, Ohio ranges from 2 to 9 employees. From the interactive tooltip label on hovering, we are able to tell which business are hiring more employees and vice versa. While a bigger-sized business does not guarantee to be more prosperous than a smaller-sized business or a start-up, it provides an overview of the status of the workplaces in Engagement and lays the ground for further analysis on employee wage.

show
jobsnum <- jobs %>% 
  group_by(employerId) %>%
  summarise(jobNum = n(),
            totalPay = sum(hourlyRate),
            avgPay = mean(hourlyRate))

tooltip_css <- 'background-color: #008080;
font-stype: bold; color: white'
jobsnum$tooltip <- c(paste0('Employer ID: ', jobsnum$employerId,
                         '\n Number of Employees: ', jobsnum $jobNum))

p <- ggplot(data = jobsnum, aes(x= jobNum)) +
  geom_dotplot_interactive(aes(tooltip = tooltip),
                           fill = '#bada55',
                           stackgroups = TRUE,
                           binwidth = 0.1,
                           color = NA,
                           method = 'histodot') +
  scale_y_continuous(NULL, breaks = NULL) + 
  scale_x_continuous(limits = c(1, 10),
                     breaks = c(1,2,3,4,5,6,7,8,9,10), 
                     labels = c(1,2,3,4,5,6,7,8,9,10)) +
  labs(x= 'Number of Employees',
       title = "How Many Jobs Is Each Workplace Provding?",
       subtitle= 'Economic in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(panel.grid.major = element_line(color= 'grey', size = 0.1),
        panel.background= element_blank(), 
        axis.line= element_line(color= 'grey'),
        plot.caption = element_text(hjust = 0))

girafe(ggobj = p,
       width_svg = 8,
       height_svg = 8*0.618,
       options = list(opts_tooltip(css = tooltip_css)))

3.1.2 Employee Hourly Pay by Workplace

The following code chunk first plots a static tree map and subsequently converted to an interactive tree map based on the total and average hourly employee pay provided by each workplace.

show
jobsnum <- jobsnum %>%
  rename('Average Hourly Pay' = 'avgPay') %>%
  mutate(group = paste(jobNum, 'Employees'))

tm <- treemap(jobsnum,
            index = c('group', 'employerId'),
            vSize = 'totalPay',
            vColor = 'Average Hourly Pay',
            type = 'value',
            title = 'Employee Wage by Workplace')

The size of the tree map is based on the total employee hourly pay provided by one workplace and that of the color is based on the average employee pay. Therefore, employers displayed with darker colors and bigger sizes are generally more prosperous and employers with lighter shades and smaller sizes are more likely to be struggling.

It is observed from the plot that some employers, although of a smaller size, are paying high average rate to their employees. This indicates that these businesses are more prosperous as they can afford higher manpower costs. On the other hand, smaller companies that are also paying low rates for their employees are likely to be struggling, because it indicates that they are either low on funding or they will lose talents to other businesses and cost business development in the long run.

The tree map allows us to look at both the overall prosperous and struggling businesses for each size of the workplace, as well as the overall proportion of the business that is contributing to job provision based on the relative size and shade of the shape.

show
d3tree(tm, rootname = 'Employee Hourly Wage by Workplace')

3.1.3 Job Pay by Education Requirement

Next, the average workplace employee wage based on different education levels are examined. Packed bar plots are created and the top 10 employers offering highest average rates are selected to view.

Before plotting, the follow code chunk prepares four different datasets on jobs of different education requirements.

show
jobsedu <- jobs %>%
  group_by(employerId, educationRequirement) %>%
  summarise(jobnum = n(),
            avgHourlyPay = round(mean(hourlyRate),2),
            totalHourlyPay = sum(hourlyRate)) %>%
  rename('Average Hourly Pay' = 'avgHourlyPay') 

jobsedu1 <- filter(jobsedu, educationRequirement=='Low')
jobsedu2 <- filter(jobsedu, educationRequirement=='High School or College')
jobsedu3 <- filter(jobsedu, educationRequirement=='Bachelors')
jobsedu4 <- filter(jobsedu, educationRequirement=='Graduate')

The following code chunks plot four packed bar charts of average hourly pay for jobs of different education requirements. These charts reveal businesses that are paying more for jobs of the same education requirements. We can also tell how much more businesses are paying by comparing the length of each bar. It is observed that there are almost no competing workplaces in four of the top 10 paying lists, possibly due to different natures of the business.

show
p1 <- plotly_packed_bar(input_data = jobsedu1, 
                       label_column = 'employerId',
                       value_column = 'Average Hourly Pay',
                       number_rows = 10,
                       plot_title = 'Top 10 Paying Workplaces for Low Education Jobs', 
                       xaxis_label = 'Average Hourly Pay',
                       hover_label = 'Average Hourly Pay',
                       min_label_width = 0.001,
                       color_bar_color = '#ffa500',
                       label_color = 'white') 

plotly::config(p1, displayModeBar = FALSE)
show
p2 <- plotly_packed_bar(input_data = jobsedu2, 
                       label_column = 'employerId',
                       value_column = 'Average Hourly Pay',
                       number_rows = 10,
                       plot_title = 'Top 10 Paying Workplaces for High School and College Education Jobs', 
                       xaxis_label = 'Average Hourly Pay',
                       hover_label = 'Average Hourly Pay',
                       min_label_width = 0.001,
                       color_bar_color = '#66cdaa',
                       label_color = 'white') 
plotly::config(p2, displayModeBar = FALSE)
show
p3 <- plotly_packed_bar(input_data = jobsedu3, 
                       label_column = 'employerId',
                       value_column = 'Average Hourly Pay',
                       number_rows = 10,
                       plot_title = 'Top 10 Paying Workplaces for Bachelor Education Jobs',
                       xaxis_label = 'Average Hourly Pay',
                       hover_label = 'Average Hourly Pay',
                       min_label_width = 0.001,
                       color_bar_color = '#bada55',
                       label_color = 'white') 
plotly::config(p3, displayModeBar = FALSE)
show
p4 <- plotly_packed_bar(input_data = jobsedu4, 
                       label_column = 'employerId',
                       value_column = 'Average Hourly Pay',
                       number_rows = 10,
                       plot_title = 'Top 10 Workplaces for Graduate Education Jobs',
                       xaxis_label = 'Average Hourly Pay',
                       hover_label = 'Average Hourly Pay',
                       min_label_width = 0.001,
                       color_bar_color = '#008080',
                       label_color = 'white') 
plotly::config(p4, displayModeBar = FALSE)

3.2 Restaurants and Pubs

3.2.1 Restaurants Monthly Customer Visits

The following code chunk plots the monthly customer visits bar chart for each restaurant using treliscopejs. By viewing sort by visits, revenue or price, we are able to identify the restaurants that are prospering or struggling. While profit margin information is not available, as the price of the restaurants do not vary significantly, it is safe to assume that those with high customer visits and high revenue are doing well and vice versa.

A overview of the visualizations show that restaurants tend to get relatively consistent number of the customer visits each month, some restaurants, such as 448 and 895 are seeing a decreasing trend of customer visits overall. Quite a few restaurants, such as 1801, 1805, 449 and etc are observed to have a decreasing trend of customer visits from Mar to May 2023.

Based on overall customer visits and revenue, these restaurants are prospering: restaurant ID 1801, 449, 1805 and 447. These restaurants are struggling: restaurant ID 1347, 445, 1346 and 1349.

show
restaurants <- restaurants %>%
  mutate(yearmonth = as.yearmon(checkInTime)) 
restaurantsv <- restaurants %>%
  group_by(restaurantId, yearmonth) %>%
  summarise(visits = n(),
            revenue = sum(spending),
            price = mean(spending))

r <- ggplot(restaurantsv, aes(x= as.factor(yearmonth), y= visits)) +
  geom_col(fill= '#008080') +
  labs(x= 'Month Year', y= 'Number of\nCustomer\nVisits',
       title = 'Monthly Customer Visits by Restaurant') +
  facet_trelliscope(~ restaurantId, 
                    nrow = 2, ncol = 2, width = 800,
                    path = 'trellisr/',
                    self_contained = TRUE) +
  theme(axis.title.y= element_text(angle=0), 
        axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.3),
        axis.ticks.x= element_blank(),
        panel.background= element_blank(), 
        axis.line= element_line(color= 'grey'))
r

3.2.2 Pubs Monthly Customer Visits

Similarly, the monthly customer visits bar chart for pubs are also plotted using trelliscopejs. By viewing sort by visits, revenue or price, we are able to identify the pubs that are prospering or struggling.

An overview of the visualizations show that all pubs have their highest customer visits in Mar 2022 and most see a gentle decreasing trend after that. By sorting customer visits and revenue, these pubs are prospering: pub ID 1344, 1342, 1343 and 1799, with pub 1344 and 5410 having predominately high visits and revenue. On the other hand, these pubs are struggling: pub ID 1443, 442, 894 and 444, with 444 having both lower visits and revenue.

show
pubs <- pubs %>%
  mutate(yearmonth = as.yearmon(checkInTime)) 
pubsv <- pubs %>%
  group_by(pubId, yearmonth) %>%
  summarise(visits = n(),
            revenue = sum(spending),
            price = mean(hourlyCost))

pub <- ggplot(pubsv, aes(x= as.factor(yearmonth), y= revenue)) +
  geom_col(fill= '#008080') +
  labs(x= 'Month Year', y= 'Number of\nCustomer\nVisits',
       title = 'Monthly Customer Visits by Pub') +
  facet_trelliscope(~ pubId, 
                    nrow = 2, ncol = 2, width = 800,
                    path = 'trellisp/',
                    self_contained = TRUE) +
  theme(axis.title.y= element_text(angle=0), 
        axis.text.x = element_text(angle = 30, vjust = 0.5, hjust = 0.3),
        axis.ticks.x= element_blank(),
        panel.background= element_blank(), 
        axis.line= element_line(color= 'grey'))
pub

3.2.2 Pubs Price and Customer Visit Duration

As the price of the pubs vary much more significantly than restaurants and is charged base on stay duration, ridge plot on the relationship of price and visit duration is examined.

It is observed from the ridge plot that generally the distribution pattern of visit duration is the same among different price range groups. All groups have the highest density of stay duration of 90-110 min, indicating that they all have a pool of steady customers. On the other hand, the higher the price of the pub, the more likely that some customers only stay for a short while (less than 20 min).

show
# bin price 
brks <- c(6, 9, 12, 15)
grps <- c('Low', 'Medium', 'High')
pubs$price_bin <- cut(pubs$hourlyCost, breaks=brks, labels = grps)

ggplot(pubs, 
       aes(x= as.numeric(visitDuration), y= fct_rev(price_bin))) +
  geom_density_ridges(rel_min_height = 0.01,
                      scale = 1,
                      colour= NA, 
                      fill = '#a0db8e',
                      alpha = 0.7) +
  labs(x = 'Duration of Customer Visits (min)',
       y = 'Hourly\nPrice',
       title = 'Density Ridge Plot of Pub Visit Duration',
       subtitle= 'Economic in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(panel.grid.major = element_line(color= 'grey', size = 0.1),
        panel.background= element_blank(), 
        axis.ticks.y = element_blank(),
        axis.title.y = element_text(angle = 0),
        axis.line= element_line(color= 'grey'),
        plot.caption = element_text(hjust = 0))

3.2.3 Monthly Customer Visits vs Monthly Revenue

The following animated bubble plot is created using below code chunk to review the change in trend of customer visits and revenue for each pub during the during of the study. The plot reviewed that all of the pubs have experienced a decrease in business revenue over the period. It indicates further that pub businesses with lower revenue are likely to be struggling as they tend to stay in the lower revenue zone.

show
p <- pubs %>%
  group_by(pubId, as.Date(checkInTime)) %>%
  summarise(visits = n(),
            revenue = sum(spending),
            price = mean(hourlyCost)) %>% 
  rename('date' = 'as.Date(checkInTime)') %>%
  ggplot(aes(x= visits, y= revenue,
                  size= price,
                  color= pubId)) +
  geom_point(alpha = 0.5,
             show.legend = FALSE) +
  scale_size(range= c(2, 12)) + 
  labs(title = 'Month Year: {frame_time}',
       x = 'Daily Customer Visits',
       y = 'Daily Revenue') +
  theme_classic() +
  transition_time(date) +
  ease_aes('linear')

animate(p, fps= 8)

4. Learning Points